Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

De Novo Genome Assembly ◾ 107

for the actual genomes. This is because it is hard to avoid errors in sequencing and also

genomes of many organisms including repetitive sequences. But the assembly accuracy

can be increased by deep sequencing, paired-end sequencing, and the use of long reads

produced by PacBio and Oxford Nanopore Technology. The de novo genome assembly

has been widely used for different kinds of organisms, specially in metagenomics for the

assembly of bacterial, fungal, and viral genomes.

We can use either greedy alignment approach or graph-based approaches for the de novo

genome assembly. The greedy method works as multiple sequence alignment by perform-

ing pairwise alignment and merging reads to build up contigs. The graph theory approach

can be either overlap-consensus graphs or de Bruijn graphs. In the overlap graphs, reads

are represented as nodes and the overlapping regions of the reads as edges. Contigs are

built by finding the Hamiltonian path which includes each node once. On the other hand,

de Bruijn graphs form k-mers from the reads and the k-mers then are represented as nodes.

Contigs are formed by including edges using Eulerian path. De Bruijn graphs are more

efficient than overlap graphs. The assemblers that use Bruijn graphs include ABySS for

small and large genomes and SPAdes for bacterial, fungal, and viral small genomes. SPAdes

program has several modules such as metagenomic module, viral assembly module, and

SARS-CoV2. SPAdes can also be used to assemble a genome from hybrid reads such as

Illumina reads and PacBio reads or Oxford Nanopore reads.

After assembling, a genome can be assessed using both statistical method and evolu-

tionary method. The statistical method generates the number, lengths, and distributions of

contigs. The assembly with few but long contigs is an indicator of the good quality. Metrics

used to describe statistical quality are N25, N50, N75, L25, L50, and L75. We can com-

pare the performance of assemblers using these metrics. The evolutionary assessment of an

assembly relies on the genomes of the evolutionarily related species to identify the number

of genes in the assembled genome. The completeness of the genome assembly depends on

the complete identified genes. For statistical quality assessment, we can use QUAST, and

for evolutionary assessment, we can use BUSCO.

REFERENCES

1. Lander ES, Waterman MS: Genomic mapping by fingerprinting random clones: a mathemati-

cal analysis. Genomics 1988, 2(3):231–239.

2. Pop M, Kosack D: Using the TIGR assembler in shotgun sequencing projects. Methods Mol

Biol 2004, 255:279–294.

3. de la Bastide M, McCombie WR: Assembling genomic DNA sequences with PHRAP. Curr

Protoc Bioinform 2007, Chapter 11:Unit11. 14.

4. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry

CM, Reinert KH, Remington KA et al: A whole-genome assembly of Drosophila. Science

2000, 287(5461):2196–2204.

5. Pevzner PA, Tang H: Fragment assembly with double-barreled data. Bioinformatics 2001,

17(suppl_1):S225–S233.

6. Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly.

Proc Natl Acad Sci U S A 2001, 98(17):9748–9753.

7. Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Res

2008, 18(2):324–330.